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Abstract 

Low-Rank Representation (LRR) has been a significant method for segmenting data that are generated from 
a union of subspaces. It is also known that solving LRR is challenging in terms of time complexity and 
memory footprint, in that the size of the nuclear norm regularized matrix is n-by-n (where n is the number of 
samples). In this paper, we thereby develop a novel online implementation of LRR that reduces the memory 
cost from 0(n 2 ) to 0(pd ), with p being the ambient dimension and d being some estimated rank (d < p <C 
n). We also establish the theoretical guarantee that the sequence of solutions produced by our algorithm 
converges to a stationary point of the expected loss function asymptotically. Extensive experiments on 
synthetic and realistic datasets further substantiate that our algorithm is fast, robust and memory efficient. 
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1 Introduction 

In the past a few years, subspace clustering [VidlO, SC 12] has been extensively studied and has estab¬ 
lished solid applications, for example, in computer vision [EV09] and network topology inference [EBN1 1]. 
Among many subspace clustering algorithms which aim to obtain a structured representation to fit the under¬ 
lying data, two prominent examples are Sparse Subspace Clustering (SSC) [EV09, SEC 14] and Low-Rank 
Representation (LRR) [LLY + 13]. Both of them utilize the idea of self-expressiveness, i.e., expressing each 
sample as a linear combination of the remaining. What is of difference is that SSC pursues a sparse solution 
while LRR prefers a low-rank structure. 

In this paper, we are interested in the LRR method, which is shown to achieve state-of-the-art perfor¬ 
mance on a broad range of real-world problems [LLY + 13]. Recently, [LL14] demonstrated that, when 
equipped with a proper dictionary, LRR can even handle the coherent data - a challenging issue in the 
literature [CR09, CLMW1 1] but commonly emerges in realistic datasets such as the Netflix. 

Formally, the LRR problem we investigate here is formulated as follows [LLY + 13]: 

min — ||Z - YX - E\\ 2 f + IIXH* + A 2 H-Elli • (1.1) 

X,E 2 

Here, Z = (z \. z 2 , • • • , z n ) € M r ' xn is the observation matrix with n samples lying in a /^-dimensional 
subspace. The matrix Y € W ,xn is a given dictionary, E is some possible sparse corruption and Ai and A 2 
are two tunable parameters. Typically, Y is chosen as the dataset Z itself. The program seeks a low-rank 


1 


representation X E M nxri among all samples, each of which can be approximated by a linear combination 
of the atoms in the dictionary Y. 

While LLR is mathematically elegant, three issues are immediately incurred in the face of big data: 

Issue 1 (Memory cost of X). In the LRR formulation (1.1), there is typically no sparsity assumption on X. 
Hence, the memory footprint of X is proportional to n 2 which precludes most of the recently developed 
nuclear norm solvers [LCM10, JS10, AKKS12, HO 14]. 

Issue 2 (Computational cost of ||X||*). Due to the size of the nuclear norm regularized matrix X is n x n, 
optimizing such problems can be computationally expensive even when n is not too large [RFP10]. 

Issue 3 (Memory cost of Y). Since the dictionary size is p x n, it is prohibitive to store the entire dictionary 

Y during optimization when manipulating a huge volume of data. 

To remedy these issues, especially the memory bottleneck, one potential way is solving the problem in 
online manner. That is, we sequentially reveal the samples z 1; Z 2 , ■ ■ ■ ,z n and update the components in X 
and E. Nevertheless, such strategy appears difficult to execute due the the residual term in (1.1). To be more 
precise, we note that each column of X is the coefficients of a sample with respect to the entire dictionary 
Y, e.g., 2 ] ~ Yx \ + e\. This indicates that without further technique, we have to load the entire dictionary 

Y so as to update the columns of X. Hence, for our puipose, we need to tackle a more serious challenge: 

Issue 4 (Partial realization of Y). We are required to guarantee the optimality of the solution but can only 
access part of the atoms of Y in each iteration. 

1.1 Related Works 

There are a vast body of works attempting to mitigate the memory and computational bottleneck of the 
nuclear norm regularizer. However, to the best of our knowledge, none of them can handle Issue 3 and 
Issue 4 in the LRR problem. 

One of the most popular ways to alleviate the huge memory cost is online implementation. [FXY13] 
devised an online algorithm for the Robust Principal Component Analysis (RPCA) problem, which makes 
the memory cost independent of the sample size. Yet, compared to RPCA where the size of the nuclear 
norm regularized matrix is p x n, that of LRR is n x n - a worse and more challenging case. Moreover, 
their algorithm cannot address the partial dictionary issue that emerges in our case. It is also worth mention¬ 
ing that [QVLH14] established another online variant of RPCA. But since we are dealing with a different 
problem setting, i.e., the multiple subspaces regime, it is not clear how to extend then - method to LRR. 

To tackle the computational overhead, [CCS 10] considered singular value thresholding technique. How¬ 
ever, it is not scalable to large problems since it calls singular value decomposition (SVD) in each iteration. 
[JS10] utilized a sparse semi-definite programming solver to derive a simple yet efficient algorithm. Un¬ 
fortunately, the memory requirement of their algorithm is proportional to the number of observed entries, 
making it impractical when the regularized matrix is large and dense (which is the case of LRR). [AKKS12] 
combined stochastic subgradient and incremental SVD to boost efficiency. But for the LRR problem, the 
type of the loss function does not meet the requirements and thus, it is still not practical to use that algorithm 
in our case. 

Another line in the literature explores a structured formulation of LRR beyond the low-rankness. For 
example, [WXL13] provably showed that combining LRR with SSC can take advantages of both methods. 
[LL14] demonstrated that LRR is able to cope with the intrinsic group structure of the data. Very recently, 
[SL16] argued that the vanilla LRR program does not fully characterize the nature of multiple subspaces, 
and presented several effective alternatives to LRR. 
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1.2 Summary of Contributions 

In this paper, we propose a new algorithm called Online Low-Rank Subspace Clustering (OLRSC), which 
admits a low computational complexity. In contrast to existing solvers, OLRSC reduces the memory cost 
of LRR from 0(n 2 ) to O(pd) (d < p <C n). This nice property makes OLRSC an appealing solution for 
large-scale subspace clustering problems. Furthermore, we prove that the sequence of solutions produced 
by OLRSC converges to a stationary point of the expected loss function asymptotically even though only 
one atom of Y is available at each iteration. In a nutshell, OLRSC resolves all practical issues of LRR and 
still promotes global low-rank structure - the merit of LRR. 

1.3 Roadmap 

The paper is organized as follows. In Section 2, we reformulate the LRR program (1.1) in a way which is 
amenable for online optimization. Section 3 presents the algorithm that incrementally minimizes a surrogate 
function to the empirical loss. Along with that, we establish a theoretical guarantee in Section 4. The 
experimental study in Section 5 confirms the efficacy and efficiency of our proposed algorithm. Finally, we 
conclude the work in Section 6 and the lengthy proof is deferred to the appendix. 

Notation. We use bold lowercase letters, e.g. v, to denote a column vector. The £2 norm and norm of 
a vector v are denoted by |n|| 2 and 11 u 11 j respectively. Bold capital letters such as M are used to denote a 
matrix, and its transpose is denoted by M T . For an invertible matrix M, we write its inverse as M 1 . The 
capital letter I r is reserved for identity matrix where r indicates the size. The yth column of a matrix M is 
denoted by rrij if not specified. Three matrix norms will be used: | M\\ t for the nuclear norm, i.e., the sum 
of the singular values, || M\\ F for the Frobenius norm and || M || x for the t\ norm of a matrix seen as a long 
vector. The trace of a square matrix M is denoted by Tr(AT). For an integer n > 0, we use [n] to denote 
the integer set {1,2, • • • , n}. 


2 Problem Formulation 


Our goal is to efficiently learn the representation matrix X and the corruption matrix E in an online manner 
so as to mitigate the issues mentioned in Section 1. The first technique for our purpose is a non-convex 
reformulation of the nuclear norm. Assume that the rank of the global optima X in (1.1) is at most d. Then 
a standard result in the literature (see, e.g., [FHB01]) showed that, 


l Y ,l - 1 

\X L = mm — 

uy,x=uv T 2 


E/|| 2 p + I|V|| 2 p 


( 2 . 1 ) 


where U € M. nxd and V € M nxd . The minimum can be attained at, for example, U 

1 

V = VqSq where X = UoSoVq is the singular value decomposition. 

In this way, (1.1) can be written as follows: 


Ai 

mm — 
U,V,E 2 


Z- YUV t -E 



+ A2 ll-E^II 1 


1 


UqSq and 


( 2 . 2 ) 


Note that by this reformulation, updating the entries in X amounts to sequentially updating the rows of U 
and V. Also note that this technique is utilized in [FXY13] for online RPCA. Unfortunately, the size of U 
and V in our problem are both proportional to n and the dictionary Y is partially observed in each iteration, 
making the algorithm in [FXY13] not applicable to LRR. Related to online implementation, another chal¬ 
lenge is that, all the rows of U are coupled together at this moment as U is left multiplied by Y in the first 
term. This makes it difficult to sequentially update the rows of U. 
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For the sake of decoupling the rows of U , as part of the crux of our technique, we introduce an auxiliary 
variable D = YU, whose size is p x d (i.e., independent of the sample size n). Interestingly, in this way, 
we are approximating the term Z — E with DV T , which provides an intuition on the role of D. Namely, 
D can be seen as a basis dictionary of the clean data, with V being the coefficients. 

These key observations allow us to derive an equivalent reformulation of LRR (1.1): 


Ai 

mm — 
D.uy.E 2 


Z-YUV 1 -E 


2 1 
f + 2 


U 


I 2 + 

IF T 


V 


) + ^2 


\E\ 


t > 


s.t. D = YU. 


(2.3) 


By penalizing the constraint in the objective, we obtain a regularized version of LRR on which our algorithm 
is based: 


Ai 

mm — 
D,U,V,E 2 


Z-DV' -E 


+ 


U 


I 2 + 

If t- 


IV 


) + ^2 


\E\U + 


A 3 


| D-YU 


12 
If 


(2.4) 


Remark 1 (Superiority to LRR). There are two advantages of (2.4) compared to (1.1). First, it is amenable 
for online optimization. Second, it is more informative since it explicitly models the basis of the union of 
subspaces, hence a better subspace recovery and clustering (see Section 5). This actually meets the core 
idea of [LL14] but they assumed Y contains true subspaces whereas we leam the true subspaces. 

Remark 2 (Parameter). Note that A 3 may be gradually increased until some maximum value is attained so 
as to enforce the equality constraint. In this way, (2.4) attains the same minimum as (1.1). Actually, the 
choice of A 3 depends on how much information Y brings for the subspace basis. As we aforementioned, D 
is the basis dictionary of the clean data and is in turn approximated by (or equal to) YU. This suggests that 
the range of D is a subset of that of Y. Asa typical choice of Y = Z, if Z is slightly corrupted, we would 
like to pick a large quantity for A 3 . 

Remark 3 (Connection to RPCA). Due to our explicit modeling of the basis, we unify LRR and RPCA as 
follows: for LRR, D « YU (or D = YU if A 3 tends to infinity) while for RPCA, D = U. That is, 
ORPCA [FXY13] considers a problem of Y = I p whose size is independent of n, hence can be kept in 
memory which naturally resolves Issue 3 and 4. This is why RPCA can be easily implemented in an online 
fashion while LRR cannot. 


Remark 4 (Connection to Dictionary Learning). Generally speaking, LRR (1.1) can be seen as a coding 
algorithm, with the dictionary Y known in advance and X is a desired structured code while other popular 
algorithms such as dictionary learning (DL) [MBPS 10] simultaneously optimizes the dictionary and the 
sparse code. Interestingly, in view of (2.4), the link of LRR and DL becomes more clear in the sense that 
the difference lies in the way how the dictionary is constrained. That is, for LRR we have D ~ YU and U 
is further regularized by Frobenius norm whereas for DL, we have | d,, \ \ 2 < 1 for each column of D. 

Let Zi, Hi, e,, u, , and v, be the ?th column of matrices Z. Y, E, U and V 1 respectively and define 


In addition, let 


£(z,D,v,e ) = f y \\z - Dv - e || 2 + i |H | 2 + A 2 He^ , 
£(z, D) = min £(z, D , v , e). 

v,e 


h{Y,D,UyY.\\\u£ + ^ 


i— 1 


D ~J2yi 


T 


i =1 


h(Y, D) = min h(Y, D, U). 


(2.5) 

( 2 . 6 ) 


(2.7) 

( 2 . 8 ) 


4 














Then (2.4) can be rewritten as: 


n 

min min , ^2£(zi,D,Vi,ei) + h(Y,D,U), 

’ ’ i=l 

which amounts to minimizing the empirical loss function: 


1 77, -i 

fn{D) = - Y, D) + —h(Y , D). 
n z —' n 


2— 1 


(2.9) 


( 2 . 10 ) 


In stochastic optimization, we are interested in analyzing the optimality of the obtained solution with re¬ 
spect to the expected loss function. To this end, we first derive the optimal solutions U*, V* and E* that 
minimize (2.9) which renders a concrete form of the empirical loss function f n (D), hence we are able to 
derive the expected loss. 

Given D, we need to compute the optimal solutions U*, V* and E* to evaluate the objective value of 
f n (D). What is of interest here is that, the optimization procedure of U is totally different from that of V 
and E. According to (2.6), when D is given, each v* and e* can be solved by only accessing the /th sample 
Zj. However, the optimal u* depends on the whole dictionary Y as the second term in h(Y. D, U) couples 
all the Ui s. Fortunately, it is possible to obtain a closed form solution for U* which simplifies our analysis. 
To be more precise, the first order optimality condition for (2.8) gives 


which implies 


dh(Y, D, U) 
dU 


U + X 3 (Y t YU - Y t D) = 0, 


u* 


(a 3 -% + t t t) 1 y t d 

+oo 

a 3^(— x 3 y t yYy t d 


+oo 


T 


As^ T ]T(-A 3 Yr 

- 3=0 

(A 3 -% + yy T ) _1 D. 


D 


Likewise, another component YU* T in (2.7) can be derived as follows: 


YU* t = D - 


1 fl 


n \n 


-I v + —N 

n 


-l 


D, 


( 2 . 11 ) 


( 2 . 12 ) 


(2.13) 


where we denote 

n 

N n = YVrVl- (2-14) 

i— 1 


Recall that m is the /'th column of U 1 . So for each i € [n], we immediately have 


U; = 


WjU+EW) y‘=i DT (^ +1 n N ' 


1=1 


-1 


Vi- 


(2.15) 
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Plugging U* and YU* T back to h(Y, D, U ) gives 


2 


h{Y,D) 




H— N n 
n 


Vi 




+ 


V, 


n 



(2.16) 


Now we derive the expected loss function, which is defined as the limit of the empirical loss function 
when n tends to infinity. If we assume that all the samples arc drawn independently and identically from 
some (unknown) distribution, we have 


1 

lim -Y / £(z i ,D) = E z [£(z,D)\. (2.17) 

n—>oo 77 , L — ' 

2=1 

If we further assume that the smallest singular value of -N n is bounded away from zero (which implies 
N n is invertible and the spectrum of N~ 1 is bounded from the above), we have 

1 1 . n , 

0 < lim — h(Y,D) < lim -77 Co = 0. (2.18) 

n—>-oo fi n —^oo ^ J 

2=1 

Here Co is some absolute constant since D is fixed and s are bounded. Hence, it follows that 

lim -h(Y,D) = 0. (2.19) 

n—>• oo fi 

Finally, the expected loss function is given by 

f{D) = lim f n (D ) = E, [£(z, D)\. (2.20) 


3 Algorithm 

Our OLRSC algorithm is summarized in Algorithm 1. Recall that OLRSC is an online implementation to 
solve (2.10), which is derived from the regularized version of LRR (2.4). The main idea is optimizing the 
variables in an alternative manner. That is, at the 2-th iteration, assume the basis dictionary D t ~i is given, 
we compute the optimal solutions {v t ,e t } by minimizing the objective function i,v,e) over v 

and e. For u t , we need a more carefully designed paradigm since a direct optimization involves loading 
the full dictionary Y (see (2.15)). We will elaborate the details later. Subsequently, we update the basis 
dictionary Dt by optimizing a surrogate function to the empirical loss f n (D). In our algorithm, we need to 
maintain three additional accumulation matrices for which the sizes are independent of n. 

Solving {v t , e t }. We observe that if e is fixed, we can optimize v in closed form: 

v = (K 1T d + Dj-vD t -^j Dj_ 1 (z t -e). (3.1) 

Conversely, given v, the variable e is obtained via soft-thresholding [Don95]: 

e = 5 a 2 /Ai 0* - D t -iv). (3.2) 

Thus, we utilize block coordinate minimization algorithm to optimize v and e. 
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Algorithm 1 Online Low-Rank Subspace Clustering 

Require: Z € W xn (observed samples), Y € M pxn , parameters Ai, A 2 and A 3 , initial basis Do € W >xd , 
zero matrices Mo € M pxd , Ao € R dxd and Bq € M pxd . 

Ensure: Optimal basis D n . 

1 : for t = 1 to n do 

2 : Access the t-th sample z t and the t-th atom y t . 

3: Compute the coefficient and noise: 

{v t ,e t } = arg min £(z t , e), 

v,e 

u t = arg min^ 2 ( 2 /*, Dt-i, M t -\,u). 

U 

4 : Update the accumulation matrices: 

M t <- M t -1 +y t uj, At <r- A t _i +v t vj, B t <-B t -i + (z t -e t )vJ. 


5: Update the basis dictionary: 

\ Tr (D T U>(AiA t + A 3 J d )) - Tr (d t (A 1 B* + A 3 M t )) 

6 : end for 


D t = arg min — 
D t 


Solving u t . The closed form solution (2.15) tells us that it is impossible to derive an accurate estimation of 
ut without the entire dictionary Y. Thus, we have to “approximately” solve it during the online optimization 
procedure 1 . 

Our carefully designed approximate process to solve h(Y, D, U) (2.7) is motivated by the coordinate 
minimization method appealing to h(Y, D. U). As a convention, such method starts with initial guess that 
ut = 0 for all i € [n\ and updates the ttj’s in a cyclic order, i.e., u\, U 2 , • • •, u n , u\, ■ ■ ■. Let us consider 
the first pass where we have already updated u\, u->, ■ ■ ■, u t - 1 and are to optimize over u t for some t > 0. 
Note that since the initial values are zero, u t +1 = u t +2 = ■■■ = «„ = 0. Thereby, the optimal u t is 
actually given by minimizing the following function: 


i 2 (y t ,D,Mt-i,u) = ^ 


\u\ 


+ 


D - M t -1 - y t u 


T 


(3.3) 


where 


t-i 

M t - X = J2yi u J- 

i=1 


(3.4) 


We easily obtain the closed form solution to (3.3) as follows: 

u t = (\\y t \\l + 1/A 3 )-\D - M t -i) T y t . (3.5) 


Now let us turn to the alternating minimization algorithm, where D is updated iteratively rather than 
fixed as in (3.5). The above coordinate minimization process can be adjusted in this scenario as we did 
in Algorithm 1. That is, given D t - 1 , after revealing a new atom y t , we compute u t by minimizing 

'Here, “accurately” and “approximately” mean that when only Dt- 1 , zt and y t are given, whether we can obtain the same 
solution {vt,et, itt} as for the batch problem (2.10). 
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(. 2 {Vti Df~ i, Mt-i,u), followed by updating D t . In this way, when the algorithm terminates, we in essence 
run a one-pass update on u t ’s with a simultaneous computation of new basis dictionary. 

Solving D t . As soon as the past filtration a, u, } f i=l are available, we can compute a new iterate D t 
by optimizing the surrogate function 

gt (D)^j(^l(z i ,D,v i ,e i ) + Y J \\\ui\\l + ^ \\D-M t \\ 2 F ) . (3.6) 

\i=l i= 1 Z ) 

Expanding the first term, we find that D t is given by 

\ Tr ( J D T E>(AiA t + A 3 J d )) - Tr (e> t (A x B t + A 3 M t )) 

= (Ai-Bf + A 3 M i )(AiA i + A 3 Jd) _1 , (3.7) 

where A t = Y^l=i v i v J an d B t = J2i.=i( z i ~ e i) v J ■ We point out that the size of A t is d x d and that of 
is p x d, i.e., independent of sample size. In practice, as [MBPS 10] suggested, one may apply a block 
coordinate descent approach to minimize over D. Compared to the closed form solution given above, such 
algorithm usually converges very fast after revealing sufficient number of samples. In fact, we observe that 
a one-pass update on the columns of D suffices to ensure a favorable performance. See Algorithm 2. 

Algorithm 2 Solving D 

Require: D € W ,Xl1 in the previous iteration, accumulation matrix M, A and B, parameters Ai and A 3 . 
Ensure: Optimal D (updated). 

1 : Denote A = X\ A + A 3 / and B = Ai-B + A 3 M. 

2: repeat 

3: for j = 1 to d do 

4: Update the jth column of D. 

dj E- dj {Daj — bj) 

A jj 

5: end for 

6 : until convergence 


Dt = arg min — 
D t 


Memory Cost. It is remarkable that the memory cost of Algorithm 1 is 0(pd). To see this, note that when 
solving v t and e t , we load the auxiliary variable D t and a sample z t into the memory, which costs 0{pd). 
To compute the optimal u t ’ s, we need to access D t and M t € W >yd . Although we aim to minimize (3.6), 
which seems to require all the past information, we actually only need to record At, Bt and Mt, whose 
sizes are at most 0(pd ) (since d < p). 

Computational Efficiency. In addition to memory efficiency, we further clarify that the the computation in 
each iteration is cheap. To compute {v tl e t }, one may utilize the block coordinate method in [RT14] which 
enjoys linear convergence due to strong convexity. One may also apply the stochastic variance reduced 
algorithms which also ensure a geometric rate of convergence [XZ14, DBL14], The u t is given by simple 
matrix-vector multiplication, which costs 0(pd). It is easy to see the update on the accumulation matrices 
is 0(pd) and that of D t is 0(pd 2 ). 

A Fully Online Scheme. Now we have provided a way to (approximately) optimize the LRR problem (1.1) 
in online fashion. Usually, researchers in the literature will take an optional post-processing step to refine 








the segmentation accuracy, for example, applying spectral clustering on the representation matrix X. In this 
case, one has to collect all the ttj’s and s to compute X = UV which again increases the memory 
cost to 0{n 2 ). Here, we suggest an alternative scheme which admits 0{kd ) memory usage where k is 
the number of subspaces. The idea is utilizing the well-known /.--means on ry’s. There are two notable 
advantages compared to the spectral clustering. First, updating the /c-means model can be implemented in 
online manner and the computation is 0{kd) for each iteration. Second, we observe that v t is actually a 
robust feature for the zth sample. Combining the online /c-mcans with Algorithm 1 , we obtain a fully and 
efficient online subspace clustering scheme where the memory cost is 0(pd). For the reader’s convenience, 
we summarize this pipeline in Algorithm 3. 


Algorithm 3 Fully Online Pipeline for Low-Rank Subspace Clustering 

Require: Z € M pxn (observed samples), Y € M pxn , parameters Ai, A 2 and A 3 , initial basis Do € R pxd , 
zero matrices M 0 € W )Xd , Ao € W !xrl and Bo € M. pxd , number of clusters k, initial centroids 

C <E R dxk . 

Ensure: Optimal basis D n , cluster centroids C, cluster assignments { 01 , 02 , ■ ■ • , o n \. 

1 : Initialize r\ = ri = • • • = = 0. 

2 : for t = 1 to n do 

3: Access the f-th sample z t and the f-th atom y t . 

4: Compute {vt, e t ,u t , D t } by Algorithm 1. 

5: Compute o* = argmin 1<J<fc \\v t — Cj || 2 . 

6 : Update the ot -th center: 

rot r Qt + 1 , 

Tot - 1 ,1 

Cot - c 0t H- Vf. 

Cot Cot 

7: end for 


An Accurate Online Implementation. Our strategy for solving u t is based on an approximate routine 
which resolves Issue 4 as well as has a low complexity. Yet, to tackle Issue 4, another potential way is to 
avoid the variable u t 2 . Recall that we derive the optimal solution U* (provided that D is given) to (2.4) is 
given by (2.12). Plugging it back to (2.4), we obtain 

\\U% = Tv( k DD t (Q n -A 3 - 1 Q^)), 

WD-YU*\\ 2 F = \\D-X 3 - 1 Q n D\\ 2 F , 

where 

Q n = (X 3 - 1 I P + N n )- 1 . 


Here, N n was given in (2.14). Note that the size of Q n is p x p. Hence, if we incrementally compute the 
accumulation matrix N t = V ■ , y,yj, we can update the variable D in an online fashion. Namely, at t-th 
iteration, we re-define the surrogate function as follows: 


gt(D) 


def 1 

t 


J 2*(zi,D,Vi,ei) + ^ 

_ i =1 


D ^Q t D 


2 1 / 

+ - Tr DD t 

F 2 V 




2 We would like to thank the anonymous NIPS 2015 Reviewer for pointing out this potential solution to the online algorithm. 
Here we explain why this alternative can be computationally expensive. 
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Again, by noting the fact that £(zi, D, v^, ef) only involves recording A t and B t , we show that the memory 
cost is independent of sample size. 

While promising since the above procedure avoids the approximate computation, the main shortcoming 
is computing the inverse of a p x p matrix in each iteration, hence not efficient. Moreover, as we will show 
in Theorem 1, although the uf s are approximate solutions, we are still guaranteed the convergence of D t . 

4 Theoretical Analysis 

We make three assumptions underlying our analysis. 

Assumption 1. The observed data are generated i.i.d. from some distribution and there exist constants oo 
and a i, such that the conditions 0 < ao < \\z\\ 2 < a\ and ao < ||y || 2 < «i hold almost surely. 

Assumption 2. The smallest singular value of the matrix jJV ( is lower bounded away from zero. 

Assumption 3. The surrogate functions gt(D) are strongly convex for all t > 0. 

Based on these assumptions, we establish the main theoretical result, justifying the validity of Algo¬ 
rithm 1 . 

Theorem 1. Assume 1, 2 and 3. Let {19 / } JY, be the sequence of optimal bases produced by Algorithm 1. 
Then, the sequence converges to a stationary point of the expected loss function f{D) when t goes to infinity. 

Note that since the reformulation of the nuclear norm (2.1) is non-convex, we can only guarantee that the 
solution is a stationary point in general [Ber99]. We also remark that OLRSC asymptotically fulfills the first 
order optimality condition of (1.1). To see this, we follow the proof technique of Prop.3 in [MMG15] and 
let X = UV T , W, = UU r ,W 2 = VV T , M 1 = M 3 = 0.5 1, M 2 = M 4 = 0.5A X Y T (YX + E- Z). 
Due to our uniform bound (Prop. 7), we justify the optimality condition. See the details in [MMG15]. 

More interestingly, as we mentioned in Section 3, the solution (3.5) is not accurate in the sense that it 
is not equal to that of (2.15) given D. Yet, our theorem asserts that this will not deviate {D t }t >o away 
from the stationary point. The intuition underlying such amazing phenomenon is that the expected loss 
function (2.20) is only determined by l(z,D) which does not involve u t . What is of matter for u t and 
M t is their uniform boundedness and concentration to establish the convergence. Thanks to the carefully 
chosen function £(z, D. M , u) and the surrogate function gt{D), we are able to prove the desired property 
by mathematical induction which is a crucial step in our proof. 

In particular, we have the following lemma that facilitates our analysis: 

Lemma 2. Assume 1 and 2 and 3. Let {Mt}t> o be the sequence of the matrices produced by Algorithm 1. 
Then, there exists some universal constant Co, such that for all t > 0 , 1111 ^ < Co- 

Due to the above lemma, the solution D t is essentially determined by jA t and jB t when t is a very 
large quantity since jM t —> 0. We also have a non-asymptotic rate for the numerical convergence of D t as 
\\D t — D t - 1 1 | 2 = (D(l/t). See Appendix B for more details and a full proof. 

5 Experiments 

Before presenting the empirical results, we first introduce the universal settings used throughout the section. 

Algorithms. For the subspace recovery task, we compare our algorithm with state-of-the-art solvers includ¬ 
ing ORPCA [FXY13], LRR [LLY + 13] and PCP [CLMW11]. For the subspace clustering task, we choose 
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ORPCA, LRR and SSC [EV09] as the competitive baselines. Recently, [LL14] improved the vanilla LRR 
by utilizing some low-rank matrix for Y. We denote this variant of LRR by LRR2 and accordingly, our 
algorithm equipped with such Y is denoted as OLRSC2. 

Evaluation Metric. We evaluate the fitness of the recovered subspaces D (with each column being nor¬ 
malized) and the ground truth L by the Expressed Variance (EV) [XCM10]: 


E V(D,L) 


def Tr (DD t LL t ) 
Ti \LL t ) 


(5.1) 


The value of EV scales between 0 and 1, and a higher value means better recovery. 

The performance of subspace clustering is measured by clustering accuracy, which also ranges in the 
interval [ 0 , 1 ], and a higher value indicates a more accurate clustering. 

Parameters. We set Ai = 1, A 2 = 1 / y/p and A 3 = y/t/p, where t is the iteration counter. These settings 
are actually used in ORPCA. We follow the default parameter setting for the baselines. 


5.1 Subspace Recovery 

Simulation Data. We use 4 disjoint subspaces {Sk}t=i C M p , whose bases are denoted by {Lk }/j. = , € 
W xdk . The clean data matrix Zy. € Sy. is then produced by Zy.= LyRj, where Ily. € W lkXdk . The 
entries of s and /i/.’s are sampled i.i.d. from the normal distribution. Finally, the observed data matrix 
Z is generated by Z = Z + E, where Z is the column-wise concatenation of Zy,\ followed by a random 
permutation, E is the sparse corruption whose p fraction entries are non-zero and follow an i.i.d. uniform 
distribution over [—2,2]. We independently conduct each experiment 10 times and report the averaged 
results. 

Robustness. We illustrate by simulation results that OLRSC can effectively recover the underlying sub¬ 
spaces, confirming that D t converges to the union of subspaces. For the two online algorithms OLRSC and 
ORPCA, We compute the EV after revealing all the samples. We examine the performance under different 
intrinsic dimension dy’s and corruption p. To be more detailed, the dy.’ s are varied from 0.01 p to 0.1 p with 
a step size 0.0 lp, and the p is from 0 to 0.5, with a step size 0.05. 



0.5 

c 0-4 

O 

h.o.3 
| 0.2 
° 0.1 
0 

0.05 0.2 0.35 0.5 0.05 0.2 0.35 0.5 

Rank/Dimension Rank/Dimension 



0.5 

SZ 0.4 
o 

5.03 
| 0.2 
° 0.1 
0 

0.05 0.2 0.35 0.5 

Rank / Dimension 


PCP 


Figure 1: Subspace recovery under different intrinsic dimensions and corruptions. Brighter is better. 
We set p = 100, riy- = 1000 and d = 4/4■ LRR and PCP arc batch methods. OLRSC consistently 
outperforms ORPCA and even improves the performance of LRR. Compared to PCP, OLRSC is competitive 
in most cases and degrades a little for highly corrupted data, possibly due to the number of samples is not 
sufficient for its convergence. 

The results are presented in Figure 1. The most intriguing observation is that OLRSC as an online 
algorithm outperforms its batch counterpart LRR! Such improvement may come from the explicit modeling 
for the basis, which makes OLRSC more informative than LRR. Interestingly, [GQV14] also observed that in 
some situations, an online algorithm can outperform the batch counterpart. To fully understand the rationale 
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behind this phenomenon is an important direction for future research. Notably, OLRSC consistently beats 
ORPCA (an online version of PCP), which may be the consequence of the fact that OLRSC takes into 
account that the data are produced by a union of small subspaces. While PCP works well for almost all 
scenarios, OLRSC degrades a little when addressing difficult cases (high rank and corruption). This is not 
surprising since Theorem 1 is based on asymptotic analysis and hence, we expect that OLRSC will converge 
to the true subspace after acquiring more samples. 

Convergence Rate. Now we test on a large dataset to show that our algorithm usually converges to the 
true subspace faster than ORPCA. We plot the EV curve against the number of samples in Figure 2. Firstly, 
when equipped with a proper matrix Y, OLRSC2 and LRR2 can always produce an exact recovery of the 
subspace as PCP does. When using the dataset itself for Y, OLRSC still converges to a favorable point after 
revealing all the samples. Compared to ORPCA, OLRSC is more robust and converges much faster for hard 
cases (see, e.g., p = 0.5). Again, we note that in such hard cases, OLRSC outperforms LRR, which agrees 
with the observation in Figure 1 . 



Number of Samples (xIO 3 ) 



— OLRSC 
— OLRSC2 
— ORPCA 
•••LRR 
► LRR2 
— PCP 


Number of Samples (xl 0 3 ) 




Figure 2: Convergence rate and time complexity. A higher EV means better subspace recovery. We 
set p = 1000, rik = 5000, dk = 25 and d = 100. OLRSC always converges to or outperforms the batch 
counterpart LRR. For hard cases, OLRSC converges much faster than ORPCA. Both PCP and LRR2 achieve 
the best EV value. When equipped with the same dictionary as FRR2, OLRSC2 also well handles the highly 
coiTupted data (p = 0.5). Our methods are more efficient than the competitors but PCP when p is small, 
possibly because PCP utilizes a highly optimized C++ toolkit while ours are written in Matlab. 


Computational Efficiency. We also illustrate the time complexity of the algorithms in the last panel of 
Figure 2. In short, our algorithms (OLRSC and OLRSC2) admit the lowest computational complexity for 
all cases. One may argue that PCP spends slightly less time than ours for a small p (0.01 and 0.1). However, 
we remark here that PCP utilizes a highly optimized C++ toolkit to boost computation while our algorithms 
are fully written in Matlab. We believe that ours will work more efficiently if properly optimized by, e.g., 
the bias routine. Another important message conveyed by the figure is that, OLRSC is always being orders 
of magnitude computationally more efficient than the batch method LRR, as well as producing comparable 
or even better solution. 

5.2 Subspace Clustering 

Datasets. We examine the performance for subspace clustering on 5 realistic databases shown in Table 1 , 
which can be downloaded from the LibSVM website. For MNIST, We randomly select 20000 samples to 
form MNIST-20K since we find it time consuming to run the batch methods on the entire database. 

Standard Clustering Pipeline. In order to focus on the solution quality of different algorithms, we follow 
the standard pipeline which feeds X to a spectral clustering algorithm [NJW01]. To this end, we collect 
all the u’s and v’s produced by OLRSC to form the representation matrix X = UV . For ORPCA, we 
use RqRq as the similarity matrix [LLY + 13], where R<) is the row space of Z o = Lo’SqRo ar| d Zq is 


12 

































Table 1: Datasets for subspace clustering. 


#classes 

#samples 

#features 

Mushrooms 

2 

8124 

112 

DNA 

3 

3186 

180 

Protein 

3 

24,387 

357 

USPS 

10 

9298 

256 

MNIST-20K 

10 

20,000 

784 


the clean matrix recovered by ORPCA. We run our algorithm and ORPCA with 2 epochs so as to apply 
backward correction on the coefficients (U and V in ours and Ri t in ORPCA). 

Fully Online Pipeline. As we discussed in Section 3, the (optional) spectral clustering procedure needs the 
similarity matrix X, making the memory proportional to n 2 . To tackle this issue, we proposed a fully online 
scheme where the key idea is performing /--means on V. Here, we examine the efficacy of this variant, 
which is called OLRSC-F. 


Table 2: Clustering accuracy (%) and computational time (seconds). For each dataset, the first row 
indicates the accuracy and the second row the running time. For all the large-scale datasets, OLRSC (or 
OLRSC-F) has the highest clustering accuracy. Regarding the running time, our method spends comparable 
time as ORPCA (the fastest solver) does while dramatically improves the accuracy. Although SSC is slightly 
better than SSC on Protein, it consumes one hour while OLRSC takes 25 seconds. 



OLRSC 

OLRSC-F 

ORPCA 

LRR 

SSC 

Mush- 

85.09 

89.36 

65.26 

58.44 

54.16 

rooms 

8.78 

8.78 

8.30 

46.82 

32 min 

DNA 

67.11 

83.08 

53.11 

44.01 

52.23 

2.58 

2.58 

2.09 

23.67 

3 min 

Protein 

43.30 

43.94 

40.22 

40.31 

44.27 

24.66 

24.66 

22.90 

921.58 

65 min 

USPS 

65.95 

70.29 

55.70 

52.98 

47.58 

33.93 

33.93 

27.01 

257.25 

50 min 

MNIST- 

57.74 

55.50 

54.10 

55.23 

43.91 

20K 

129 

129 

121 

32 min 

7 hours 


The results are recorded in Table 2, where the time cost of spectral clustering or /.‘-means is not included 
so we can focus on comparing the efficiency of the algorithms themselves. Also note that we use the dataset 
itself as the dictionary Y because we find that an alternative choice of Y does not help much on this task. 
For OLRSC and ORPCA, they require an estimation on the true rank. Here, we use 5 k as such estimation 
where k is the number of classes of a dataset. Our algorithm significantly outperforms the two state-of-the- 
art methods LRR and SSC both for accuracy and efficiency. One may argue that SSC is slightly better than 
OLRSC on Protein. Yet, it spends 1 hour while OLRSC only costs 25 seconds. Hence, SSC is not practical. 
Compared to ORPCA, OLRSC always identifies more correct samples as well as consumes comparable 
running time. For example, on the USPS dataset, OLRSC achieves the accuracy of 65.95% while that of 
ORPCA is 55.7%. Regarding the running time, OLRSC uses only 7 seconds more than ORPCA - same 
order of computational complexity, which agrees with the qualitative analysis in Section 3. 
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Table 3: Time cost (seconds) of spectral clustering and A-means. 



Mushrooms 

DNA 

Protein 

USPS 

MNIST-20K 

Spectral 

295 

18 

7567 

482 

4402 

/c-means 

2 

6 

5 

19 

91 


More interestingly, it shows that the fc-means alternative (OLRSC-F) usually outperforms the spectral 
clustering pipeline. This suggests that perhaps for robust subspace clustering formulations, the simple k- 
means paradigm suffices to guarantee an appealing result. On the other hand, we report the running time 
of spectral clustering and A>means in Table 3. As expected, since spectral clustering computes SVD for an 
n- by -n similarity matrix, it is quite slow. In fact, it sometimes dominates the running time of the whole 
pipeline. In contrast, fc-means is extremely fast and scalable, as it can be implemented in online fashion. 

5.3 Influence of d 

A key ingredient of our formulation is a factorization on the nuclear norm regularized matrix, which requires 
an estimation on the rank of the X (see (2.1)). Here we examine the influence of the selection of d (which 
plays as an upper bound of the true rank). We report both EV and clustering accuracy for different d under 
a range of corruptions. The simulation data are generated as in Section 5.1 and we set p = 200, rq, = 1000 
and dk = 10. Since the four subspaces are disjoint, the true rank is 40. 




Figure 3: Examine the influence of d. We experiment on d = {2, 20,40, 60, 80,100,120,140,160,180}. 
The true rank is 40. 

From Figure 3, we observe that our algorithm cannot recover the true subspace if d is smaller than the 
true rank. On the other hand, when d is sufficiently large (at least larger than the true rank), our algorithm 
can perfectly estimate the subspace. This agrees with the results in [BM05] which says as long as d is large 
enough, any local minima is global optima. We also illustrate the influence of d on subspace clustering. 
Generally speaking, OFRSC can consistently identify the cluster of the data points if d is sufficiently large. 
Interestingly, different from the subspace recovery task, here the requirement for d seems to be slightly 
relaxed. In particular, we notice that if we pick d as 20 (smaller than the true rank), OFRSC still performs 
well. Such relaxed requirement of d may benefit from the fact that the spectral clustering step can correct 
some wrong points as suggested by [SEC 14]. 


6 Conclusion 

In this paper, we have proposed an online algorithm called OFRSC for subspace clustering, which dramati¬ 
cally reduces the memory cost of ERR from 0(n 2 ) to 0(pd) - orders of magnitudes more memory efficient. 
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One of the key techniques in this work is an explicit basis modeling, which essentially renders the model 
more informative than LRR. Another important component is a non-convex reformulation of the nuclear 
norm. Combining these techniques allows OLRSC to simultaneously recover the union of the subspaces, 
identify the possible corruptions and perform subspace clustering. We have also established the theoretical 
guarantee that solutions produced by our algorithm converge to a stationary point of the expected loss func¬ 
tion. Moreover, we have analyzed the time complexity and empirically demonstrated that our algorithm is 
computationally very efficient compared to competing baselines. Our extensive experimental study on syn¬ 
thetic and realistic datasets also illustrates the robustness of OLRSC. In a nutshell, OLRSC is an appealing 
algorithm in all three worlds: memory cost, computation and robustness. 
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A Proof Preliminaries 


Lemma 3 (Corollary of Thm. 4.1 [BS98]). Let / : F x K ? 4 1. Suppose that for all x G W the function 
f(x, •) is differentiable, and that f and X7 u f(x,u) are continuous on R p X R 9 . Let v(u) be the optimal 
value function v{u) = min xe c f(x,u), where C is a compact subset ofMP. Then v(u) is directionally 
differentiable. Furthermore, if for Uq € R 9 , /(■, Uq) has unique minimizer Xq then v(u) is differentiable in 
u 0 and X7 u v(u 0 ) = V u f(x 0 , u 0 ). 

Lemma 4 (Corollary of Donsker theorem [vdVOO]). Let F = {/# : X —y R, 9 G 0} be a set of measurable 
functions indexed by a bounded subset © ofW 1 . Suppose that there exists a constant K such that 

\fdi(x) ~ fe 2 (x)\ <# 1101 - 02112 , 

for every 6\ and 62 in 0 and x in X. Then, F is P-Donsker. For any f in F, let us define P n f, P/ and G n f 
as 


1 

P nf=~Y,f(Xi), 

i —1 

Ff = E[f(X)f 
G n f = \/n(P n f - P/). 

Let us also suppose that for all f, P/ 2 < 5 2 and 11 /11 ^ < XL and that the random elements X \, X 2 , • • • are 
Borel-measurable. Then, we have 


e||G|| f = 0 ( 1 ), 


where ||G|| F = sup /eF |G n /|. 

Lemma 5 (Sufficient condition of convergence for a stochastic process [Bot98]). Let (il.fF. P) be a mea¬ 
surable probability space, Ut, for t > 0 , be the realization of a stochastic process and Pi be the filtration by 
the past information at time t. Let 


St 


1 if E[ttt +1 - u t | P t \ > 0, 
0 otherwise. 


If for all t, ut > 0 and 1 — Ut)\ < 00 , then ut is a quasi-martingale and converges almost 

surely. Moreover, 


y . \E[u t+ i -u t \Pt]\< +00 a.s. 

t =1 

Lemma 6 (Lemma 8 from [MBPS 10]). Let at, bt be two real sequences such that for all t, at > 0 , bi > 0 , 
SSi °t = °°> a tbt. < 00 , 3iC > 0, such that |6i + i — bf < Ka t . Then, lim^+oo bt = 0. 

B Proof Details 

B.l Proof of Boundedness 

Proposition 7. Let {ut}, {vt}, {e/ } and {Dt} he the optimal solutions produced by Algorithm 1. Then, 
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1. Vt, e t , jA t and \B t are uniformly bounded. 

2. Mt is uniformly bounded. 

3. D t is supported by some compact set T>. 

4. Ut is uniformly bounded. 

Proof. Let us consider the optimization problem of solving v and e. As the trivial solution {v' t ,e' t } = 
{ 0 , z t } are feasible, we have 

li{z t ,D t _i,v' t ,e' t ) = X 2 \\z t \\ 1 . 

Therefore, the optimal solution should satisfy: 

II11 2 T 2 ll^tll 2 X 2 ll^illi — A 2 ll^tlli ) 

which implies 

\ |||| 2 < A 2 ll^tlli, A 2 ||e* || x < A 2 H^tlk • 

Since z t is uniformly bounded (Assumption 1), v t and e t are uniformly bounded. 

To examine the uniform bound for jA t and jB t , note that 

1 A 1 T 

1 A t = -'£v i v i , 

i=l 

2—1 

Since for each i, v\, e, and z, are uniformly bounded, jA t and jB t are uniformly bounded. 

Now we derive the bound for M t . All the information we have is: 

1 . M t = Y!i=i Vi u J (definition of M t ). 

2. u t = (|| y t \\l + i - M t _i) T y t (closed form solution). 

3. D t (X\A t + A;;/) = Ai B t + \:>,Mi (first order optimality condition for D t ). 

4. i A t , \B U \N t arc uniformly upper bounded (Claim 1). 

5. The smallest singular values of jNt ar| d jA t are uniformly lower bounded away from zero (As¬ 
sumption 2 and 3). 

For simplicity, we write D t as: 

D t = (X 1 B t + X 3 M t )Qf\ (B.l) 

where 

Qt = Ai A t + A 3 1 . 

Note that as we assume \A t is positive definite, Q, is always invertible. 
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From the definition of M t and (3.5), we know that 


M t+ 1 - M t = y t+ iu 


t +1 


1 


-i 


= (J|y m || 2 + — J Dt+iVt+i (Dt — M 

= P t D t - P t M t 

= P t (\iB t + A 3 M t )Qj l - P t M t , 


where 


Pt = f ||yt +1 ||2 + y t+iVt+i ■ 

By multiplying Q t on both sides of (B.2), we have 

M t+1 = (M t - Ai P t M t A t Q~ l ) + A iP t B t Q~ l . 
By applying the Taylor expansion on Q t 1 , we have 

1 + °° / X \ i 

QT 1 = ( A 1 A t + As idY 1 = y 3 Y, \Y 3 At ) • 


Thus, 


tQt 1 - uYl 


+oo / , \ I 


As V As 

1 +oo 


(A t ) 


i +1 


Ai 

1 

A7 

1 

Ai 




4=0 
+oo 


A 3 


4+1 


,£+F 


4+1 


-In 


Id + Y 3 At ) + T 1 Ii ' 


So Mt+i is given by 


M t+1 = (I d - P t )M t + P t M t ( I d + '-^A t ) + Ai PtBtQt 1 


Ai 

A 3 


-1 


We first show that P t B t Q t 1 is uniformly bounded. 

-1 


\P t B t QY\\ = 


Pt T^f -Q 


1 


t 


Wt 


< ll^ll • 


-B t 

t 


-Qt 

t 1 


(B.2) 


(B.3) 


(B.4) 


Since we assume that {z t } are uniformly upper bounded (Assumption 1), there exists a constant a±, such 
that for all t > 0, 


\z t \\o < Ctl- 
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So we have 


\Pt+i\\ < 


\^a\ + 1 


Next, as we have shown that jB f can be uniformly bounded, there exists a constant ci, such that for all 
t > 0, 


1 


-B, 


And, 


-Qi 


< Cl. 


1 


(Tmir 


(iQt) 

i 


min i ±A t + ^ I d 


V n 


1 


+ Ai£J m in (jAj) 


< 


A 3 + Ai/3o 

Thus, \\PfB t Qj 1 is uniformly bounded by a constant, say C 2 . That is, 


|Ai P t B t Q t 1 || < C2. 


(B.5) 


It follows that Wt can be bounded: 


^11 < II Pt 


I 


Id + Y At 

A 3 


-1 


+ C2 


Ci 

< o 


oq 


A3 


a\ + ^ A 3 + Ai/3ot 


| Md| + c 2 


C3 


< j\\Mt\\+c 2 , 


(B-6) 


where is derived by utilizing the assumption that z is upper bounded by a \ and the smallest singular 
value of j At is lower bounded by /?o- The last inequality always holds for some uniform constant C3. 

From Assumption 2, we know that the singular values of ~ YU= 1 z i z J should uniformly span the diag¬ 
onal. Thus, there exists a constant r, such that for all i > 0, A ^* +T z y zj is uniformly bounded away from 
zero with high probability. 

Let mi = 11 M \ \ \. Now we pick a constant t *, such that 

C-iT , 1 

^— + 1 <0.5. (B.7) 

t* a 0 

We also have a constant w*, such that for all t <t*, 


W t \\<W*, yTOl + 0.5m* + c 2 < w*. 


(B.8) 
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Based on this, we first derive a bound for all || M t \\, with t < t*. We know that there exists an integer k* 
(which is a uniform constant), such that k*(r + 1) < t* < ( k* + l)(r + 1). Our strategy is to bound ||iV/" t || 
in each interval [( k — l)(r + 1), A:(r + 1)]. We start our reasoning from the first interval [1, r + 1]. 

It is easy to induce from (B.4) that for all t > 0, 


Thus, 


t -1 t 


M t+ i= n( j p - p i) M i +e n + w t- 

3=1 i=3+1 


i =1 


I M 


T+l | 


T— 1 T 


H( - Pi)M 1 + ]T n - p i) w 3 + W * 

3=1 i=3+1 


i =1 



T 


T— 1 

T 

< 

[](f p - Pi)Mi 

+ 

E 

II ( J p - 


2—1 


j=i 

=i+i 


Ci 

< 


C2 


n<w. 


i=l 


\Mi\\ + TW* 


< (1 — «o )m\ + tw* 


Here, (j holds because 1 for all j € [r — 1 ], Q> holds because the singular values 

of Pi s have span over the diagonal so the largest singular value of 0 [ = , (I p — Pi) is 1 — ao, where qq is 
the lower bound for all Zi s (see Assumption 1). 

For M 2 (T+ i), we can similarly obtain 


|M 2 ( t+ i)|| < (1 — a 0 ) 2 mi + (1 - a 0 )rw* + tw* 


More generally, for any integer k < k *, 


M 


k -1 

fc (r+i) || < (! - aofrm + ^(1 

3=0 


ao) J TW* < m i + 


«o 


Hence, we obtain a uniform bound for ||M fc ( r+1 )||, with k E [A;*]. For any i E (( k — l)(r + 1), fc(r + 1)), 
they can simply bounded by 

TW* TW* 

j| Mj]\ < mi H-1- (i — (k — l)(r + l))w* < mi H-h tw*. 

ao ao 

Therefore, for all the current M t ’s, we can bound them as follows: 

TW* 

|| Af*II < H- 1 - tw*, V t = 1, 2, • • • , t*. (B.9) 

ao 

From (B. 8 ) and (B.9), we know that for all t < t*, 

TW* 

|| Wt|| < w*, \\M t \ <mi +-+TW*. 

ao 

Next, we show that the bounds still hold for ||Wt*+i|| and || M t * + \\\, which completes our induction. 
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For ||MV + i ||, it can simply be bounded in the same way as aforementioned because all the Wf s are 
bounded by w* for t < t* + 1. That is, 


TW 


|M t » + i|| < M fc .( r+1 ) + (t* + 1 - k*(r + l))w* <mH-b rw* 

«o 


(B-10) 


For I IFF 


t*+i| 


from (B.6), we know 


\Wf +l \\ < 


C3 


t* + 1 


|-M"t*+l|| + C2 


„ C3 . TW 

< : ( m i H-brm ) + c 2 

t* + 1 ao 

— 1 1 ( b l)tn + c 2 

f* + 1 t* + l ao 

Ci csmi 

< TT-T + °- 5w + c 2 

t* + 1 


C 2 

< w*. 


(B .11) 


Flere, ('| is derived by utilizing (B.7) and ( 2 is derived by (B.8). 

From (B.10) and (B.ll), we know that the bound for ||AT t || and ||FFt||’s, with t < t*, still holds for 
t = t* + 1. Thus we complete the induction and conclude that for all t > 0, we have 

TW 

||M f ||<mi +- +rw*, \\W t \\<w*. 

ot 0 

Thus, M t is uniformly bounded. 

From (B.l), we know that 


D t = Ai B t (Ai A t + A 3 Id)- 1 + A 3 M t (Ai A t + As^)' 1 


— Ai 


B , 


-A, + ^I d 

t t 


-1 


+ T m ‘ 


T^< + ~t~ Id 


-1 


Since \A t , jB t and M t are all uniformly bounded, D t is also uniformly bounded. 

By examining the closed form of u t , and note that we have proved the uniform boundedness of D t and 
Mt, we conclude that {-i^} are uniformly bounded. □ 

Corollary 8. Let Vt, fit, u t find Dt be the optimal solutions produced by Algorithm 1. 


1. £(z t , Dt, vt, et ) and £(zt, Dt) are uniformly bounded. 

2. jh(Z, D, U ) is uniformly bounded. 

3. The surrogate function (ji (Dt ) defined in (3.6) is uniformly bounded and Lipschitz. 

Proof. To show Claim 1, we just need to examine the definition of £(zt , D t , v t , e t ) and notice that z t , D t , 
v t and e t are all uniformly bounded. This implies that £(z t , D t ,v tl e t ) is uniformly bounded and so is 
£(z t , Df). Likewise, we show that jh(Z, D, U) is uniformly bounded. 

The uniform boundedness of gt(D t ) follows immediately as £(z t . D t , v t , e/) and \Ji(Z. D. U) are 
both uniformly bounded. To show that gt{D) is Lipschitz, we show that the gradient of gt{D) is uniformly 
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bounded for all D €V. 


\\Vg t (D)\\ F = 


Ai D 


— + —/A — A x — — —M t 
t t d t t 


f 

A f 


^3 

\ . 



A 3 



+ 

— Id 

) + Ai 


+ 


V 

t 

F 

t 

F/ 

t 

F 

t 


< Ai \\D\ 

Notice that each term on the right side of the inequality is uniformly bounded. Thus the gradient of gt(D) 
is uniformly bounded and gt(D) is Lipschitz. □ 


Proposition 9. Let D G T> and denote the minimizer of l(z. D, v, e) as: 

= argmin £(z,D,v,e). 

v,e 

Then, the function £(z,L) is continuously differentiable and 

V D £(z, D) = (. Dv' + e' - z)v' T . 
Furthermore, £{z. •) is uniformly Lipschitz, 

Proof By fixing the variable z, the function £ can be seen as a mapping: 


([v; e],D) i-> £(z,D,v,e ). 

It is easy to show that for all [t>; e] G ]R d+p , £(z, e) is differentiable. Also £(z, ■, •, •) is continuous on 
M rf+P x V. V d £(z, D , v, e) = (Dv + e — z)v T is continuous on M. d+P x V. X/D G V, since £(z, D,v, e ) 
is strongly convex w.r.t. v, it has a unique minimizer \v'. e'}. Thus Lemma 3 applies and we prove that 
£(z,D) is differentiable in D and 

V d £(z, D ) = ( Dv ' + e' - z)v' T . 

Since every term in V d£{z, D) is uniformly bounded (Assumption 1 and Proposition 7), we conclude that 
the gradient of £(z, D) is uniformly bounded, implying that £(z, D) is uniformly Lipschitz w.r.t. D. □ 


Corollary 10. Let ft(D) be the empirical loss function defined in (2.10). Then f) (D) is uniformly bounded 
and Lipschitz. 


Proof Since £(z,L) can be uniformly bounded (Corollary 8 ), we only need to show that jh(Z,D) is 
uniformly bounded. Note that we have derived the form for h(Z, D) as follows: 


i =1 


D 


x 3 t Ip+ t Nt 


-1 


A3 
H-- 

2 f 3 


-I 

t 


P + Y Nt 


-1 


D 


where N t = 1 UiVl ■ Since every term in the above equation can be uniformly bounded, h(Z. D) is 

uniformly bounded and so is ft(D). 

To show that ft(D ) is uniformly Lipschitz, we show that its gradient can be uniformly bounded. 


1 * 1 

Vft(D) = ~Y^ V£( Zi , D) + -Vh(Z,D) 


i— 1 
t 


\Y J (Dv i + e i -z i )vJ + ^i2\YTf I P + 7 N 


i= 1 

A 3 (1 A 3 

+ fU 7 ' + t jv * 


2—1 


A 3 1 


-1 




-1 


D 


-2 


D. 
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Then the Frobenius norm of V ft{D) can be bounded by: 


1 y \ 

l|V/t(-D )|| F < II Dvi + ei - Zi || 2 • ||ui|| 2 

i —1 



One can easily check that the right side of the inequality is uniformly bounded. Thus \\Vft(D)\\ F is uni¬ 
formly bounded, implying that ft{D) is uniformly Lipschitz. □ 


B.2 Proof of P-Donsker 

Proposition 11. Let ft(D) = |E-=i i{z u D) and f(D) be the expected loss function defined in (2.20). 
Then we have 


nVt\\fi-f\\j = o(i). 

Proof. Let us consider {£{z, D) } as a set of measurable functions indexed by D £ 'D. As we showed 
in Proposition 7, V is a compact set. Also, we have proved that £(z, D) is uniformly Lipschitz over D 
(Proposition 9). Thus, {£(z, D )} is P-Donsker (see the definition in Lemma 4). Furthermore, as £(z, D) is 
non-negative and uniformly bounded, so is £ 2 (z , D). So we have E z [£ 2 (z. D)] being uniformly bounded. 
Since we have verified all the hypotheses in Lemma 4, we obtain the result that 


□ 


B.3 Proof of convergence of g t (D ) 

Theorem 12 (Convergence of the surrogate function g t (D t )). The surrogate function gfiDt) we defined in 
(3.6) converges almost surely, where D t is the solution produced by Algorithm 1. 

Proof Note that gt{D t ) can be viewed as a stochastic positive process since every term in gfiDfi is non¬ 
negative and the samples are drawn randomly. We define 


u t = gt(Df). 
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To show the convergence of iq, we need to bound the difference of ut+i and ?//: 


u t +i - u t 

= 9t+i{D t+ i) — gt(Dt) 

= 9 t+i(D t+ i) - g t+ i(Dt) + gt+i(D t ) - gt{D t ) 

= gt+i(D t+ i ) — g t+ i(D t ) + - ; £(zt+ li D t ) — ^ g't(Dt) 


+ 


t + 1 

• II 2 4- -^3 

2 + 1 ^ 2 " Ui|12 + 2(2 + 1 ) 

2—1 


t + 1 


1 ‘+ 1 i 

— V- 

4 1^5 


|£h-M 


1 


t+l II F 


* 1 

E) 


112 A 3 n r-. n/r 112 
£ / y r> ll' U i|l2 ~ ■ 11-^* 

2—1 


2t 


i IIf 


= 9i+l(Dl+1 ) - + ^‘ +I ’ D ‘ ) - / ‘ (Dt) 


2+1 


+ 


1 m 1 
i- y~ 

-u 1 ^ 9 


^3 


t+l 

t 


..Ui II o + , x 

t + l ^2 11 *" 2 2 t + l 

2=1 x 7 


^3 


\ Dt M t+1 f F t J2 2 \\Ui\r2 2t 

2=1 


\D t -M t \\ z F 


(B.12) 


Here, 


9t( D t) = T ^2i(zi,D,Vi,ei). 


2=1 


First, we bound the last four terms. We have 

,9 —1 


t+l , t 

I II 2 1 

2+1 ' 2 * 2 2 ^ 

2=1 2=1 


—E- 

4-1 Eo 


* 1 
Y- 

2(2 + !) + 2 


| Ui 112 + 


2(2 + 1 ) 


ll^t+1 112 < 


(B.13) 


2(2 + 1 ) 


IK+ill 


2 • 


(B.14) 


Also we have 


A 3 


A 3 


2(2 + 1 ) ^ Dt Mt+ ^ F 2 t^ Dt Mt ^ F 


—A 3 


22 ( 2 + 1 ) 

—A 3 

22 ( 2 + 1 ) 


| D t -M t \\i,+ 


A3 


\Dt — MtW'p + 


F ' 2(2 + 1) 

2 , A3 


2(2 + 1 ) 


z t +iuj +1 


z t +iuj +1 


2 _ 

F 2+1 
2 A 3 
F 2+1 


TV \ (D t - M t ) z t+1 u} +l 

|| z 1+1 ||^ + T )|| Ul+1 ||2 


< 


1 


< 


2+1 

1 

2+1 


A3 

2 


2 t+lWfl 


- (A 3 H^t+111 2 + 1) 111 


-^\\z t+1 \\ 2 2 \\u t+1 \\ 2 2 -\\u t+l \\ 2 2 


(B.15) 


where the first equality is derived by the fact that M t +1 = M t + Zt+iuJ, l5 and the second equality is 
derived by the closed form solution of u t +\ (see (3.5)). 

Combining (B.14) and (B.15), we know that 


t+i 


2 + 1^2 

2=1 


1 " .|| 2 _i+|| w .i | 2 + 
* 2 t 2^ + 


1 iE- ||u,;||: ’ 


2=1 


—|| D t - M t+ i||^, - — 
2 ( 2 + 1 )" + " F 22 


\D t -M t \\ z F 


< 


1 


2(2 + 1 ) 

1 


IK+ 1 II+ 1 


t+l 


-yll*.+i|l2ll“<+i||+IK+i||| 


t+l 


-E IN.+lIll ll«i+l|ll - | ll«*+illl ) <0. 
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Therefore, 


1 


ut+i — ut< g t +i(D t +i) - gt+i(Dt ) + D t ) - y-yy g't(D t ) 

_ ^ ^ , fl(D t )-g'(D t ) , £(z t+ 1 ,D t )-fl(D t ) 

= 9t+i{JJt+i) ~ 9t+i{-L f t) H-r—;-h 


t + 1 


t + 1 


< 


e(z t+1 ,D t )-f’(D t ) 
t + 1 


where f[(D) is defined in Proposition 11, and the last inequality holds because D t+ \ is the minimizer of 
gt+i(D) and g[(D) is a surrogate function of f[(D). 

Let Ft be the filtration of the past information. We take the expectation on the above equation condi¬ 
tioned on Ft- 

E[£(z t+1 ,D t )\F t \- fl(D t ) 


K[u t+ i - u t | F t ] < 
< 


< 


t + 1 

f{Dt) ~ ti(Dt) 

t+1 

wf-nWoo 

t + 1 


From Proposition 11 , we know 


E[||/-/<'IIJ=0(^)- 


Thus, 


E[E[ui+i - u t | F t ] + ] = E[max{E[rq+i - u t \ Ft], 0}] < 


y/t(t + 1) 


where c is some constant. 

Now let us define the index set 


and the indicator 


T = {t | E[u t+ i - ut | F t ] > 0}, 


St = 


1 , if t € T, 
0 , otherwise. 


We have 


y^E[t? t (n t+ i - u t )] = ^E[n t+ i - u t \ 


t =l 


teT 


= y^E[E[n t+ i - u t | F t ]] 
teT 

OO 

= ^E[E[n t+1 - u t | F t ] + ] 

t =l 

< + oo. 

Thus, Lemma 5 applies. That is, gt{Dt) is a quasi-martingale and converges almost surely. Moreover, 

OO 

^|E[rt m - u t | F t ]\ < +oo, a.s. (B.16) 

□ 


t =l 
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B.4 Proof of Convergence of D t 

Proposition 13. Let be the basis sequence produced by the Algorithm 1. Then, 

\\D t+1 -D t \\ F = o(^y (B.17) 

Proof. According the strong convexity of gt.(D) (Assumption 3), we have, 

9t{Dt + i) — gt{D t ) > ^ ||.Df+i — D t \\ 2 F , (B.18) 


On the other hand, 

gt(D t + 1 ) — gt{D t ) 

= dt{D t + 1 ) - gt+i{D t+ i) + g t +i{D t+ i) - g t+ i(D t ) + g t+ i(D t ) — gt(D t ) 

< 9t(D t+ 1 ) — g t+ i(D t+ i) + g t+ i(D t ) — g t (D t ) 

= G t (D t+1 )-Gt(D t ). (B.19) 

Note that the inequality is derived by the fact that gt+i(Dt+i ) — gi+\ (Dt) < 0 , as Dt+\ is the minimiz er 
of g t+1 (D). We denote g t (D) - gt+i(D) by G t (D). 

By a simple calculation, we obtain the gradient of Gt{D): 

\7Gt(D) = Vgt(D)-Vg t+ i(D) 

= \[d (A, A t + A 3 I d ) - (Ai B t + A 3 M t ) 

— JTjTY ^(Ai^.t+i + A 3 Id) — (Ai-Bt+i + X 3 M t +i) 

= \ D ( A ‘' 4 * + ^ - rrr' 4 *+> - tt t i<! ) 

+ ~ Mt+i — A 3 M t 

= t D (tri- 4 '* 1 ~ + rfi 1 *) 

+ Ai(zt+i - ct+OwJ-i - Jfn B t +1 + \ 3 z t+ 1 uJ +1 - j^jM t+1 


26 



















So the Frobenius norm of VGt(D) follows immediately: 


I|vg*gd)|| f 


i 

< - 

t 


+ Ai 


D\\ F ^Ai 

Bt+i 


A 


t +1 


+ A 


t + 1 
D\\ F ^Ai 
Bt+1 


t + 1 
+ A 3 


+ Ai 


v t +ivj +1 


A 3 - - 

H-7 I IT 

F t + 1 


d\\F 


+ Ai 


{z t+ i - e t+1 )vj +1 


z t+i u t+i 


F + ^L||M t+1 || F 


A 


t+i 


t + 1 


t + 1 
+ A 3 


+ Ai 


vt+ivj + 1 


^t+iW+i 


+ 


F 

A 3 


+ Ai 


t(t + 1 ) 


(*t+i - e t+iK+i 

IIf + ll-^t+i IIf) ■ 


We know from Proposition 7 that all the terms in the above equation are uniformly bounded. Thus, there 
exist constants c\, C 2 and C 3 , such that 


II VG t (D)|| F < \ (d ||D|| F + c 2 ) + 

According to the first order Taylor expansion, 

G t (D t+ 1 ) - G t (D t ) = Tr ((A+i - A) T (aD t + (1 - a) A+i)) 

< ll-Dt+i — D t \\ F ■ ||VGt (aD t + (1 — a) D t +i)\\ F , 

where a is a constant between 0 and 1. According to Proposition 7, D t and D t +1 are uniformly bounded, 
so aD t + (1 — a) D t +\ is uniformly bounded. Thus, there exists a constant C4, such that 


\\7Gt (aL t + (1 — a) Tt+i )|| F < — + 


t f(f + l)’ 


resulting in 


G t (D t+1 ) - G t (D t ) < ( ^ + C3 


ID 


t t.(t +1 ) 

Combining (B.18), (B.19) and the above equation, we have 

2c 4 1 2c 3 1 


t +1 


D 


t\\F ' 


Dt+i ~ Dt II t? — 


t\\F 


+ 


Po t p 0 t(t + 1) 


□ 


B.5 Proof for convergence of ft(D t ) 

Theorem 14 (Convergence of f t (Df)). Let ft(D t ) be the empirical loss function defined in (2.10) and D t 
be the solution produced by the Algorithm 1. Let bt = gt(D t ) — ft(Dt). Then, bt converges almost surely 
to 0. Thus, ft(Dt) converges almost surely to the same limit of gt{Df). 
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Proof. Let f' t {D) and g[(D) be those defined in Proposition 11 and Theorem 12 respectively. Then, 

bt = gt{D t ) - f t (D t ) 


= </t(Dt)-fl(D t ) + 


1 V - ' 1 ii ii 2 A 3 .. 1, 2 

+ 0 11 112 2 2 F 


t f-f 2 

L t=l 


ly I 

t 3 ^ 2 

i= 1 


D ' T| ^ + ^ 


1 


1 


-1 


A 3 

22 3 


I 7 " + t n> 


-1 


Dt 


= g' t (D t )-f' t {D t ) + q t {D t ), 

where qt{D t ) denotes the last four terms. Combining B.12, we have 

bt 

t + 1 

_ St (A) - ft(D t ) + qt{D t ) 


t + 1 


£ + 1 


^ , ^ m ,A)-//(A) , 

— St+i(A+i) — St+t(A) H--h ut ~ ut+i 


2 + 1 


+ 


ft(A) 


t+1 


+ 


2 + 1 2 + 1 ^ 2 
2=1 


1 ^ 11 112 , A 3 

1 1 - 11“*^ + 


2(2 + 1 ) 




A3 


'Uji o 

t ^ 2 " 112 22 

2=1 


ID* - M 


t\\F 


Note that we naturally have 

qt(D t ) 


1 t 

2+1 “ 2(2 + 1) ^ 

2=1 


1 I. 11 2 
- 11 Ui 11 2 + 


-II D t - Mt\\l 

22(2 + 1 )" " F 


1 v- 1 ,,2 , c 

~ 2(2 + 1) 2^2 " Ui '' 2 + 22(2 + 1)’ 

where the second inequality holds as D t and AT t are both uniformly bounded (see Proposition 7). 
Also, from (B.14), we know 


-1 t~\~ 1 
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2+1 2 8112 2 . 
2=1 2=1 


1 ,, |. 2 —1 v - ^ ^ 11 11 2 

II 2 = T77T\ z_^ 9 11^® 11 2 + 
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t (*+ 1 ) + 


' Iki+illl 


2(2 + 1 ) 


And from (B.15) 

A3 


2(2 + 1 ) 


12 A 3 
” 22 


D t — M t+ i\\ F — — || D t — M t \\ F < 
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-yll^+illlK+ill+IK+ill" 


Thus, 
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Therefore, 


bt ^ i frt\ , K z t+1- A) - ft{D t ) 

< <74+1 (A+t) — <74+1 (A) H- t i -i -1~ ut ~ u t+ 1 + 


t + 1 


t + 1 


22(2 + 1 ) 


^£(z t+u D t )-fl(D t ) , 

<---h u t - u t+ 1 + 


£ + 1 ' " " _ri ' 22(2 + 1)' 

By taking the expectation conditioned on the past information Tt , we have 

bt / /(A) - /t(A) , T , 1 , 

<-:-b E[ut - itt+i | JTJ + 


t + 1 


< 


2 + 1 11 ^ 1 " J 22(2 + 1) 
+ |E[ttt — ut+ 1 | + 


y/t(t+ 1) ' ^ ±l ‘' tJI ' 22(2+1)’ 

where the second inequality holds by applying Proposition 11. Thus, 


bt ci 

t + 1 ~~ ^ + f) 




4=1 


“( 22(2 + 1) 


< +oo. 


Here, the last inequality is derived by applying (B.16). 

Next, we examine the difference between 6 t +i and b t : 

\bt+! - bt\ = |<?4+i(A+i) - ft+i(D t+1 ) - g t (D t ) + f t (D t )\ 

< |</4+i(A+i) — g t (D t +i)\ + \g t (D t+ i) - gt{D t )\ 

+ |/ t+ i(A+t) - /t(A+i)| + |/t(A+i) - / t (A)| • 

For the first term on the right hand side, 

\gt+i(D t+1 ) - g t (D t+ i)\ 
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<7m(A+i) - g' t {D t+ 1 ) + X] 2 
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22(2 + 1 ) 


I A+i — M f || p + 


A3 
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z 4 +l w t+l 


Cl 
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(B.20) 
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where c\ and C 2 are some uniform constants. Note that (j holds because all the tij’s, D t+ \, M t and z t + 1 are 
uniformly bounded (see Proposition 7), and Q 2 holds because £(z t+ \, D t+1 ) and g' t (D t+1 ) are uniformly 
bounded (see Corollary 8 ). 

For the third term on the right hand side of (B.20), we can similarly derive 


\ft + i(D t+1 ) - f t (D t+ 1 )| < | fl+i(D t+ i) - f t {D t+ 1 )| + ^ T 

= t~+i^ Zt + 1 ’ Dt +i) - 1 ) 


+ 


C3 

t + 1 


< C4 
- t + V 


where C 3 and C 4 are some uniform constants, and <(3 holds as £(zt+i, Dt+i) and f[(D t+ 1 ) are both uni¬ 
formly bounded (see Corollary 10). 

From Corollary 8 and Corollary 10, we know that both gt(D) and ft(D ) are uniformly Lipschitz. That 
is, there exists uniform constants k 2 , such that 


1 9t(Dt+i) ~ 9t(D t ) | <ni H-Dt+i — D t || F < ^ ^ , 

\ft(D t +i) - ft(D t )| <k 2 \\D t+1 - D t \\ F < 

Flere, (4 and ^5 are derived by applying Proposition 13 and k :! and H 4 are some uniform constants. 
Finally, we have a bound for (B.20): 


l^+t - b t | < 


K 0 

t+ 1’ 


where kq is some uniform constant. 

By applying Lemma 6 , we conclude that {bt} converges to zero. That is, 

lim g t (D t ) - f t (D t ) = 0. 

t—»-1-00 

Since we have proved in Theorem 12 that gt(Dt) converges almost surely, we conclude that ft(Dt) con¬ 
verges almost surely to the same limit of gt(Dt). □ 

Theorem 15 (Convergence of Let f(D) be the expected loss function we defined in (2.20) and let 

D t be the optimal solution produced by Algorithm 1. Then f(D t ) converges almost surely to the same limit 
offt(Dt) (or, g t (D t )). 

Proof According to the central limit theorem, we know that \fi(f(D t ) — ft(Df)) is bounded, implying 

lim f(D t ) - f t (D t ) = 0, a.s. 

t —»+oo 


□ 


B.6 Proof of gradient of f(D) 

Proposition 16 (Gradient of f(D)). Let f(D) be the expected loss function which is defined in (2.20). 
Then, f(D) is continuously differentiable and V f(D) = E Z [V d(( z - D)\. Moreover, V f(D) is uniformly 
Lipschitz on V. 
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Proof. We have shown in Proposition 9 that £(z, D ) is continuously differentiable, f(D) is also continu¬ 
ously differentiable and we have V f(D) = E Z [V£>£(2, D)]. 

Next, we prove the Lipschitz of V/(-D). Let v'{z', D') and e'{z', D') be the minimizer of £{z', D', v, e). 
Since £(z, D,v, e ) has a unique minimum and is continuous in z, D , v and e, v'(z’, D') and e'{z', D ') is 
continuous in z and D. 

Let A = {j | e ' 3 f ()}. According the first order optimality condition, we know that 


d£(z, D, v, e ) 
de 


= 0 , 


which implies 


Ai (z — Dv — e) G A2 ||e|| x . 


Hence, 


{z- Dv- e)j\ = Vj € A. 


Since z — Dv — e is continuous in z and D. there exists an open neighborhood V, such that for all 
{z", D") € V, if j A, then \ (z" — D"v" — e")j\ < ^ and e"- = 0. That is, the support set of e! will not 
change. 

Let us denote H = [D I p ], r = [u T e T ] T and define the function 


e(z,H,r A ) = ^\\z-H A r A \\ 2 2 


+ o II °] r ylll2 + ^2 


I]r 


/till 


Since £(z, D A . •) is strongly convex, there exists a uniform constant such that for all r", 


£(z\ H' a , r" A ) - £{z\ H' a , r' A ) > Kl |K - r'£ = «i 


// | | 2 , II _// J M 2 

v - v \\ 2 + \\e A — e, 


-' A \\2 


(B.21) 


On the other hand, 


£{z\ H' a , r" A ) - £(z', H' a , r' A ) = £(z>, H’ a , r’f) - £(z", H% r" A ) + £(z\ H% r" A ) - £(z\ D' a , r' A ) 

< £{z\ H a , r" A ) - £{z\ H% r" A ) + £{z\ H% r' A ) - £(z\ H' a , r' A ), 

(B.22) 


where the last inequality holds because r" is the minimizer of £{z". H". r). 

We shall prove that £(z', H A ,r A ) — £(z". H ' A , r A ) is Lipschitz w.r.t. r, which implies the Lipschitz of 
v'(z, D ) and e\z , D). 


V r ( £(z'i H' a , r A ) — £{z", H a , r A 

= Ai 


Note that \\H' 
C 2 , such that 


A\\F 


H'J(H' a - H" a ) + (H' a - H" a ) t H" a + H'J{z" - z') + (H" a - H[ a ) t z" . 

11 and z" arc all uniformly bounded. Hence, there exists uniform constants ci and 


V r (e{z?,H' A ,r A ) ~ £{z",H" A ,r A )) ^ < a || H\ - H" a \\ f + c 2 \\z> - z h 
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which implies that i(z', H' a , rfi)— £(z", H" a , r/ i) is Lipschitz with Lipschitz constant c(H' A , H" a . z' , z") = 
ci ||i?^ — H" a \\ f + C2 Hz' — z" || 2 . Combining this fact with (B.21) and (B.22), we obtain 


\\ r A ~ r A 


l < c(H' a ,H" 


Ai 


z , z 


\r" 

\ r A 


-r 


2 ‘ 


Therefore, r(z, D) is Lipschitz and so are v(z. D ) and e(z, D). Note that according to Proposition 9, 


V/(D') - V/(D" 


= E ; 
=E Z 


H'r' - z)v' T - (. H"r " - z)v" T 
H'r'(v' - v") T + (H r - H")r'v" T + H"{r' - r")v" T + z{v" - v') T 


Thus, 


Ci 


|V/(-D') — V/(-D ,/ )|| f < E 


\ H ' r '\\ 2 \\ v ' ~ v "\\ 2 + \\ H ' - H "\ 


r’v" T 


+ \\H”\\ F \\r' -r"\\ 2 \\v "\\ 2 + \\z\\ 2 \\v , -v "\\ 2 


(71 + 72 ll- 2 !^) \\h' — h" 


C 2 

<E 2 

II , „ll 

< 70 \\ D - D \\ F 


where 70 , 71 and 72 are all uniform constants. Here, £1 holds because for any function s(z), we have 
I|Ez[s( z )]II.f — E z [||s(z)|| f ]. C 2 is derived by using the result that r(z, H) and v(z, H) are both Lipschitz 
and H’, H", r', r", v f and v" are all uniformly bounded. ^3 holds because z is uniformly bounded and 
actually \\H' — H"\\ f = || D' — D"\\ f . Thus, we complete the proof. □ 


B.7 Proof of stationary point 

Theorem 17 (Convergence of D t ). Let {D t } be the optimal basis produced by Algorithm 1 and let f(D) 
be the expected loss function defined in (2.20). Then D t converges to a stationary point of f(D) when t 
goes to infinity. 

Proof. Since j A t and \B t are uniformly bounded (Proposition 7), there exist sub-sequences of {j A/} and 
{ jB t } that converge to and B 0 0 respectively. Then D t will converge to D 0 Q . Let W be an arbitrary 

matrix in M r ’ x,/ and {/)/,.} be any positive sequence that converges to zero. 

As gt is a surrogate function of ft, for all t and k, we have 

gt{D t + h k W ) > f t {pt + h k W). 

Let t tend to infinity, and note that /(D) = lim^oo ft.(D), we have 

9 oo(Doo + h k W ) > f(D 00 + h k W ). 

Note that the Lipschitz of V/ indicates that the second derivative of /(D) is uniformly bounded. By a 
simple calculation, we can also show that it also holds for gt(D). This fact implies that we can take the first 
order Taylor expansion for both gt{D) and /(D) even when t tends to infinity (because the second order 
derivatives of them always exist). That is, 

T\-(%VL T V 5 oo (D co )) + o(h k W ) > TV(/ lfc W T V/(D 00 )) + o(h k W). 
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By multiplying / tfc ||i^H on b°th s ^ es an d note ih at {^k} is a positive sequence, it follows that 
Tr (> Tr (-±-W T V 


h k \\W\ 


h k \\W\ 


Now let k go to infinity. 


-W T Vp 00 (D 00 ) > Tr 


-W T Vf(D c 


Note that this inequality holds for any matrix W £ M. pxd , so we actually have 

Vpoo(-Doo) = V/(£>oo). 

As Doc is the minimize! - of g^D), we have 


V/(-Doo) = Vpoot-Doo) = 0. 
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